Skip to content

Conversation

mjosaarinen
Copy link

Summary:
rv64v support (risc-v vector extension 1.0, which is available on newer application-class silicon.)

Steps:
If your pull request consists of multiple sequential changes, please describe them here:

Performed local tests:

  • lint passing
  • tests all passing
  • tests bench passing
  • tests cbmc passing

Do you expect this change to impact performance: Yes/No
yes (risc-v only)

If yes, please provide local benchmarking results.
Roughly 2.5x perf on silicon with vector hardware.

@mjosaarinen mjosaarinen requested a review from a team as a code owner May 19, 2025 14:57
@mjosaarinen mjosaarinen force-pushed the rv64v-dev branch 3 times, most recently from 0637941 to 6b4f845 Compare May 19, 2025 18:35
@hanno-becker
Copy link
Contributor

@mjosaarinen If you have nix setup, running autogen should hopefully resolve the linting issues.

@mjosaarinen mjosaarinen force-pushed the rv64v-dev branch 2 times, most recently from dcab861 to 38e79bb Compare May 19, 2025 20:40
@rod-chapman
Copy link
Contributor

Note that on the "fastntt3" branch, there are layer-merged implementations of the NTT and INTT that are highly amenable to auto-vectorization with compilers like GCC 14. Benchmarks of that code on an RV64v target were encouraging, so might provide some inspiration for a fully vectorized, hand-written back-end.

@mjosaarinen
Copy link
Author

Note that on the "fastntt3" branch, there are layer-merged implementations of the NTT and INTT that are highly amenable to auto-vectorization with compilers like GCC 14. Benchmarks of that code on an RV64v target were encouraging, so might provide some inspiration for a fully vectorized, hand-written back-end.

Yeah you can easily double the speed with autovectorization alone, and some Google folks were of the opinion that they wanted to rely on that entirely in BoringSSL (RISC-V Android etc), rather than maintain a hand-optimized version. The resulting code is pretty wild; I looked at that when considering RISC-V ISA extensions ( see slides 17 for example in https://mjos.fi/doc/20240325-rwc-riscv.pdf ). It was almost "too good" -- I suspect that Google has used those NTTs as a microbenchmark when developing LLVM autovectorizers :)

@mjosaarinen
Copy link
Author

@mjosaarinen If you have nix setup, running autogen should hopefully resolve the linting issues.

Yeah, sorry for abusing your CI like that (I wasn't expecting it to be that extensive), I could have just read the documentation. I'll set up this nix thing.

@hanno-becker
Copy link
Contributor

hanno-becker commented May 20, 2025

@mjosaarinen Sorry, we should have pointed that out earlier. With the nix environment, you should not need to waste anymore time making the linter happy. Just run format && autogen before pushing.

@hanno-becker
Copy link
Contributor

hanno-becker commented May 27, 2025

@mkannwischer @mjosaarinen The RV functional tests in the ci-cross shell don't seem to pass.

% nix develop --extra-experimental-features 'nix-command flakes' .#ci-cross
% CROSS_PREFIX=riscv64-unknown-linux-gnu- make func OPT=1 AUTO=1 -j32
% EXEC_WRAPPER=qemu-riscv64 make run_func_512 -j32 OPT=1 AUTO=1
qemu-riscv64 test/build/mlkem512/bin/test_mlkem512
ERROR (test/test_mlkem.c,41)
ERROR (test/test_mlkem.c,225)
make: *** [Makefile:53: run_func_512] Error 1

Same for OPT=0.

Could you investigate?

Copy link
Contributor

@hanno-becker hanno-becker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RV64 cross tests in CI are failing. I don't know if this is an issue with the code or the CI setup, but it needs looking into.

@mjosaarinen
Copy link
Author

The RV64 cross tests in CI are failing. I don't know if this is an issue with the code or the CI setup, but it needs looking into.

It looks like some things are running fine, and some things have build problems. I suspect that platform auto-detection code (in the C header files + makefies) is to blame (failing no_opt means "no optimization"?) It's a quite complicated CI -- can you tell me what would be the best way to debug this stage of the CI build process locally?

@hanno-becker
Copy link
Contributor

hanno-becker commented May 27, 2025

It's a quite complicated CI -- can you tell me what would be the best way to debug this stage of the CI build process locally?

Sure. If you have nix installed, then it's exactly what I posted above: You open the nix shell with

% nix develop --extra-experimental-features 'nix-command flakes' .#ci-cross

Building using cross compiler:

% CROSS_PREFIX=riscv64-unknown-linux-gnu- make func AUTO=1 

And then running:

% qemu-riscv64 test/build/mlkem512/bin/test_mlkem512
% qemu-riscv64 test/build/mlkem768/bin/test_mlkem768
% qemu-riscv64 test/build/mlkem1024/bin/test_mlkem1024

The base command line used by AUTO=1 above is:

riscv64-unknown-linux-gnu-gcc -Wall -Wextra -Werror=unused-result -Wpedantic -Werror 
  -Wmissing-prototypes -Wshadow -Wpointer-arith -Wredundant-decls -Wno-long-long 
  -Wno-unknown-pragmas -Wno-unused-command-line-argument -O3 -fomit-frame-pointer 
  -std=c99 -pedantic -MMD  -march=rv64gcv_zvl256b 
  -DMLK_FORCE_RISCV64 -DMLK_CONFIG_USE_NATIVE_BACKEND_ARITH 
  -DMLK_CONFIG_USE_NATIVE_BACKEND_FIPS202 -DMLK_CONFIG_PARAMETER_SET=1024 

What is confusing to me is that this happens regardless of whether OPT=0 or OPT=1 (it defaults to OPT=0). If OPT=0, the new files aren't even compiled.

If you don't want be bothered by auto-settings, we can for now set AUTO=0 and manually specify everything on the command line using CC and CFLAGS:

% CFLAGS="-march=rv64gcv_zvl256b " CC=riscv64-unknown-linux-gnu-gcc make func AUTO=0
...
% qemu-riscv64 test/build/mlkem1024/bin/test_mlkem1024
ERROR (test/test_mlkem.c,49)
ERROR (test/test_mlkem.c,225)

@hanno-becker
Copy link
Contributor

hanno-becker commented May 27, 2025

@mkannwischer @mjosaarinen

It looks like the issue is solely due to -march=rv64gcv_zvl256b. Even on main, if you do:

CFLAGS="-march=rv64gcv_zvl256b" CC=riscv64-unknown-linux-gnu-gcc make func -j32 AUTO=0 OPT=0
qemu-riscv64 test/build/mlkem512/bin/test_mlkem512

it'll fail with

ERROR (test/test_mlkem.c,41)
ERROR (test/test_mlkem.c,225)

Probably some configuration is missing in the invocation of qemu-riscv64? Nowhere do we say that we're emulating a system with 256-bit vector extension.

@hanno-becker
Copy link
Contributor

qemu-riscv64 -cpu rv64,v=true,vlen=256 works. @mjosaarinen is that the right config?

@mjosaarinen
Copy link
Author

qemu-riscv64 -cpu rv64,v=true,vlen=256 works. @mjosaarinen is that the right config?

Hmm. I'm not sure if qemu is able to pull the standard runtime and dynamic libraries that match with the "vector ABI" (the calling convention changes somewhat with vector registers). I had to link the test programs with -static when I was testing the code. I was testing with spike and vector silicon myself, sorry I have not looked yet at the qemu environment used by CI.

Anyway, the code assumes "vector 1.0" extension, 256-bit VLEN, and the standard "gc" stuff, so hat looks correct. I left the Keccak instruction stuff out.

@hanno-becker
Copy link
Contributor

I adjusted the CI. Most tests pass, but there are some issues in the ACVP tests and the monobuild examples. The ACVP one seems to be an issue in the test script (not expecting parameters in the exec-wrapper), while the monobuild failure seems to be an architecture confusion. I'll take a look.

@hanno-becker hanno-becker mentioned this pull request May 28, 2025
@hanno-becker hanno-becker force-pushed the rv64v-dev branch 3 times, most recently from 951ab19 to 0c4d7e8 Compare May 28, 2025 12:19
@hanno-becker hanno-becker reopened this Sep 24, 2025
@hanno-becker
Copy link
Contributor

@mjosaarinen Whops, apologies for the accidental close.

Copy link
Contributor

@hanno-becker hanno-becker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're close I think 👍

  • The initial reduction can be replaced by the final normalization, as in the AArch64 and x86_64 backends.
  • I think the conditional subtraction in the invNTT doesn't work, the test #1205 witnesses an overflow that ensues on a constant input of value -673. @mjosaarinen Can you remove the csub from the butterflies and instead add the necessary reductions where needed? It's probably the same as in the AArch64 invNTT I suppose, see https://github.com/pq-code-package/mlkem-native/blob/main/dev/aarch64_clean/src/intt.S#L246 and following.
  • We should extend src/sys.h with the necessary system capability and guard the backend accordingly -- see e.g. how it's done for AVX2. Otherwise, one cannot target hosts with heterogeneous capabilities with a single build (e.g. RV64 CPU with/without vector extension, or with vector extension but different length). Probably we want a capability for "Vector Extension of VL=256". What's the correct name here? Once we have that, we should extend CI to test the library on a CPU a) without vector extension, b) with vector extension, but different VL -- in both cases, it should just fall back to the C code, of course. I can do that part, though.

@hanno-becker
Copy link
Contributor

hanno-becker commented Oct 12, 2025

@mjosaarinen Regarding e15004e: Is there anything fundamentally different about the instructions available in RV64 that prevents the 'normal' signed arithmetic to be used throughout the invNTT? I don't understand the need to work with unsigned coefficients yet.

@mjosaarinen
Copy link
Author

@mjosaarinen Regarding e15004e: Is there anything fundamentally different about the instructions available in RV64 that prevents the 'normal' signed arithmetic to be used throughout the invNTT? I don't understand the need to work with unsigned coefficients yet.

Well, the invNTT takes in signed values but internally keeps them in the canonical range [0,q-1] throughout (I don't think FIPS 203 requires signed integers anywhere) This could be seen as a precaution, but it is not so bad for performance: Since there is now an initial reduction, we can remove the final normalization which is relatively expensive: b31855d
The RVV ISA itself is not restrictive -- in many ways, it has almost too much choice; the core modular multiplication can be done in a couple of different ways (e.g., widening 16->32 bit or non-widening), the best of which may depend on micro-architecture. Due to a lack of concrete hardware benchmarking targets, it is difficult to say which one will be best.

@hanno-becker
Copy link
Contributor

hanno-becker commented Oct 12, 2025

@mjosaarinen Thanks, yes you are right this is a valid strategy.

On the other hand, we don't require normalized outputs and always normalised only as part the serialisation at the end (nowhere else). This works based on a moderate output bound of 8q. So we could just get rid of any final normalisation in the invNTT and it would be fine, assuming a selected set of reductions at a few coefficients in the middle of the invNTT -- but I would strongly suspect those to be much less expensive than a cadd+csub in every butterfly.

Would you mind trying this? Again, not saying your approach is wrong, but we don't really gain anything from it because the rest of the code does not need stronger output bounds.

@mjosaarinen
Copy link
Author

@mjosaarinen Thanks, yes you are right this is a valid strategy.

On the other hand, we don't require normalized outputs and always normalised only as part the serialisation at the end (nowhere else). This works based on a moderate output bound of 8q. So we could just get rid of any final normalisation in the invNTT and it would be fine, assuming a selected set of reductions at a few coefficients in the middle of the invNTT -- but I would strongly suspect those to be much less expensive than a cadd+csub in every butterfly.

Would you mind trying this? Again, not saying your approach is wrong, but we don't really gain anything from it because the rest of the code does not need stronger output bounds.

Well, I tried little bit of that (and the code can pass the tests with some of those modifications) but I realized that I was just guessing again what those "selected set of reductions" could be. Let's leave that for later, ok?

You know, an appropriate thing would be to work out the actual bounds systematically -- for every step and (proven-correct) arithmetic "gadget" -- and always have the proven output properties as input assumptions further down the line.

Anyway, at present, this is not a particularly optimized implementation -- there are some important things to address in relation to performance, like Keccak.

@rod-chapman
Copy link
Contributor

@mjosaarinen Hanno and I have done detailed analysis of the various reduction strategies for the INTT. See the "fastntt3" branch, for example, which does INTT (in C) using a "1,1,2,3" layer merging strategy. It also turns out that much better reduction strategy can be achieved if we can prove that the inputs to the INTT have magnitude bounded by INT16_MAX/2. We have proved this for the C code (on the "refine_bounds") branch and AArch64 code, but need to extend that proof to cover x86_64, which will require some modification of the basemul code. Perhaps we will return to this another time?

@hanno-becker
Copy link
Contributor

@mjosaarinen Let's do as you say. I think one can follow the Arm backend here, but if that's so -- I will try out to be sure -- it can be a follow-up.

Similarly, you mentioned you aim for a VL agnostic version. Can this be a follow up too?

@hanno-becker
Copy link
Contributor

You know, an appropriate thing would be to work out the actual bounds systematically -- for every step and (proven-correct) arithmetic "gadget" -- and always have the proven output properties as input assumptions further down the line.

Yes! For the record, this is exactly what the HOL-Light proofs for Arm are doing. You will see tight bounds in the HOL-Light specs that would be hard to verify on paper and impossible by eye. However, the frontend never relies on those bounds. The bounds required by the frontend can so far be audited by eye. This means, we go as far into lazy reduction as can easily be annotated and followed in the code, but not further. That may leave some unnecessary reduction in the code, but at least for the invNTT the returns of more and more lazy reduction diminish quickly.

@mjosaarinen
Copy link
Author

mjosaarinen commented Oct 12, 2025

@mjosaarinen Let's do as you say. I think one can follow the Arm backend here, but if that's so -- I will try out to be sure -- it can be a follow-up.

Similarly, you mentioned you aim for a VL agnostic version. Can this be a follow up too?

I think it can be a follow-up - there is always a fallback to the C back-end.

The thing about that is; while some parts can be easily made VLEN-agnostic using the "strip-mining" approach (discussed in the RISC-V manual), it seems to me that for efficient NTT/INTT, one would need to have separate versions for VLEN in { 128, 256, 512, 1024, perhaps others } due to the actual register allocation.

btw.. In RVV, the vector length VL is different fom VLEN -- VL holds the dynamic vector length (in elements, like bytes or floats) while VLEN is the hardware size (in bits) of each of the 32 vector registers. Vectors (VL) do not have to be a power-of-two in size and can span up to 8 vector registers (LMUL). The max hardware register size allowed by the spec is VLEN=65536 bits. Due to LMUL one can technically do a lot of stuff in a single instruction (operate on 8*256 = 2048 bits even with VLEN=256) so that's an another thing to explore. This code uses only LMUL=1 and LMUL=2.

@hanno-becker
Copy link
Contributor

hanno-becker commented Oct 12, 2025

@mjosaarinen Very interesting, thank you. I should say "VLEN" then, not "VL".

What else needs doing then for this PR, from your perspective? I will aim to go through everything again, probably later or tomorrow.

Do you have input on my note above on runtime detection of vector extension support?

We should extend src/sys.h with the necessary system capability and guard the backend accordingly -- see e.g. how it's done for AVX2. Otherwise, one cannot target hosts with heterogeneous capabilities with a single build (e.g. RV64 CPU with/without vector extension, or with vector extension but different length). Probably we want a capability for "Vector Extension of VL=256". What's the correct name here? Once we have that, we should extend CI to test the library on a CPU a) without vector extension, b) with vector extension, but different VL -- in both cases, it should just fall back to the C code, of course. I can do that part, though.

@hanno-becker
Copy link
Contributor

0-diff rebase to remove merge commits.

hanno-becker and others added 6 commits October 12, 2025 19:29
…sics.)

Signed-off-by: Markku-Juhani O. Saarinen <[email protected]>
Signed-off-by: Matthias J. Kannwischer <[email protected]>
Signed-off-by: Hanno Becker <[email protected]>
- Don't skip over 'opt' tests in CI when testing RV64
- Configure qemu to use 256-bit VL.

Once we support other VLs, the invocation of qemu needs
to be generalized accordingly.

Signed-off-by: Hanno Becker <[email protected]>
Signed-off-by: Matthias J. Kannwischer <[email protected]>
Signed-off-by: Hanno Becker <[email protected]>
@hanno-becker
Copy link
Contributor

Did some history cleanup, and re-added bounds assertions aiming to check that coefficients stay unsigned-canonical.

@mjosaarinen
Copy link
Author

mjosaarinen commented Oct 12, 2025

Did some history cleanup, and re-added bounds assertions aiming to check that coefficients stay unsigned-canonical.

It no longer even compiles (this file is needed.)

  CC      test/build/mlkem1024/unit/mlkem/src/native/riscv64/src/rv64v_poly.c.o
mlkem/src/native/riscv64/src/rv64v_poly.c:15:10: fatal error: rv64v_settings.h: No such file or directory
   15 | #include "rv64v_settings.h"
      |          ^~~~~~~~~~~~~~~~~~
compilation terminated.

Signed-off-by: Markku-Juhani O. Saarinen <[email protected]>
@mjosaarinen
Copy link
Author

@mjosaarinen Very interesting, thank you. I should say "VLEN" then, not "VL".

What else needs doing then for this PR, from your perspective? I will aim to go through everything again, probably later or tomorrow.

Well, I was thinking of running some benchmarks on real RVV hardware (SpecemiT X60) on Monday-Tuesday and perhaps tweaking some things a bit based on results.

Do you have input on my note above on runtime detection of vector extension support?

We should extend src/sys.h with the necessary system capability and guard the backend accordingly -- see e.g. how it's done for AVX2. Otherwise, one cannot target hosts with heterogeneous capabilities with a single build (e.g. RV64 CPU with/without vector extension, or with vector extension but different length). Probably we want a capability for "Vector Extension of VL=256". What's the correct name here? Once we have that, we should extend CI to test the library on a CPU a) without vector extension, b) with vector extension, but different VL -- in both cases, it should just fall back to the C code, of course. I can do that part, though.

From a practical viewpoint, it should be ok to have the existence of RVV (yes/no) as a build-time flag. After all, I understand that the Android and Ubuntu kernel & userland will be built for an RVA23-like processor profile and wouldn't work without Vector. Now, the physical vector length (VLEN) is another matter; it is intended to be dynamic and is supposed to be handled by the programs directly. RVV code can do that with 1 instruction (it's designed to be fast) and without syscalls etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

benchmark this PR should be benchmarked in CI RV64

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants